Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal?

jh-lim · May 27, 2025, 4:49pm

I’m working on evaluating open-source LLMs (e.g., Phi, Llama, Qwen), and I’ve noticed that the benchmark scores I get are consistently different from the ones reported in their tech reports or papers — sometimes by a wide margin.

Sometimes the results are lower than expected, but surprisingly, sometimes they’re higher too. My point is: there are (many) cases where the difference is quite large, and it’s not clear why.

I’ve tried:

Using lm-eval-harness with the default settings
Matching tokenizers and prompt formats as best as possible
Evaluating on standard benchmarks like MMLU, GSM8K, ARC, etc, in the reports under the same few-shot conditions

Despite this, the scores I get are often significantly different from what’s published — and I can’t find any official scripts or clear explanations of the exact benchmarking setup used in those papers.

This seems to happen not just with one model, but across many open-source models.

Is this a common experience in the community?

Are papers using special prompt engineering or internal eval setups they don’t release?
Am I missing some key benchmarking tricks?
Is this just part of the game at this point?

Would really appreciate if anyone can share:

Experience trying to reproduce scores
Any evaluation tips
Benchmarking setups that actually match reported numbers

Thanks in advance!

John6666 · May 28, 2025, 12:55am

The backend, or rather, if the library version or options passed to the generation function (such as temperature) are different, the results may vary, so I think it can only be used as a rough guide. Leaderboards are easy to compare because they use the same criteria within the same leaderboard, but I don’t think there are many absolute indicators that can be used. For large companies, I think the output of the endpoints officially provided by the company can be used as a reference.

Topic		Replies	Views
Reasoning LLM Benchmarking 🤗Transformers	2	641	March 24, 2025
Can't reproduce Open LLM Leaderboard v2 normalized scores Beginners	3	201	September 4, 2024
Different results from checkpoint evaluation when loading fine-tuned LLM model Intermediate	5	3213	September 22, 2023
Benchmarking LLMs 🤗Transformers	1	1331	August 20, 2024
I'm a little scared, I'm new Beginners	0	80	July 31, 2024

Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal?

Related topics